Search CORE

19 research outputs found

Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources

Author: Hänig Christian
Publication venue
Publication date: 17/04/2013
Field of study

This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora

Qucosa - Publikationsserver der Universität Leipzig

LUPO Landesumweltportale als modularisierte, verteilte Anwendung

Author: Greceanu Claudia
Grieß Christina
Hänig Kai
Koch Lars
Niemeier Rüdiger
Schillinger Wolfgang
Schlachter Thorsten
Schmitt Christian
Tauber Martina
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2018
Field of study

KITopen

UniHI 4: new tools for query, analysis and visualization of the human protein–protein interactome

Author: Bader
Breitkreutz
Brown
Chaurasia
Christian Hänig
Erich E. Wanker
Futschik
Futschik
Gautam Chaurasia
Goehler
Jenny Russ
Joshi-Tope
Kanehisa
Kerrien
Lage
Lehner
Liu
Matthias E. Futschik
Mishra
Persico
Ramani
Rual
Salwinski
Sigrid Schnoegl
Soniya Malhotra
Stelzl
Su
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Human protein interaction maps have become important tools of biomedical research for the elucidation of molecular mechanisms and the identification of new modulators of disease processes. The Unified Human Interactome database (UniHI, http://www.unihi.org) provides researchers with a comprehensive platform to query and access human protein–protein interaction (PPI) data. Since its first release, UniHI has considerably increased in size. The latest update of UniHI includes over 250 000 interactions between ∼22 300 unique proteins collected from 14 major PPI sources. However, this wealth of data also poses new challenges for researchers due to the complexity of interaction networks retrieved from the database. We therefore developed several new tools to query, analyze and visualize human PPI networks. Most importantly, UniHI allows now the construction of tissue-specific interaction networks and focused querying of canonical pathways. This will enable researchers to target their analysis and to prioritize candidate proteins for follow-up studies

Crossref

PubMed Central

Plymouth Electronic Archive and Research Library

Sapientia

MDC Repository

UniHI: an entry gate to the human protein interactome

Author: Bader
Brown
Chaurasia
Christian Hänig
Erich E. Wanker
Eyre
Futschik
Gautam Chaurasia
Gavin
Giot
Goehler
Gunsalus
Hanspeter Herzel
Ito
Joshi-Tope
Kasprzyk
Lehner
Li
Lim
Matthias E. Futschik
Mrowka
Peri
Persico
Ramani
Rual
Salwinski
Stelzl
Uetz
von Mering
Yasir Iqbal
Publication venue: Oxford University Press
Publication date: 07/12/2006
Field of study

Systematic mapping of protein–protein interactions has become a central task of functional genomics. To map the human interactome, several strategies have recently been pursued. The generated interaction datasets are valuable resources for scientists in biology and medicine. However, comparison reveals limited overlap between different interaction networks. This divergence obstructs usability, as researchers have to interrogate numerous heterogeneous datasets to identify potential interaction partners for proteins of interest. To facilitate direct access through a single entry gate, we have started to integrate currently available human protein interaction data in an easily accessible online database. It is called UniHI (Unified Human Interactome) and is available at . At present, it is based on 10 major interaction maps derived by computational and experimental methods. It includes more than 150 000 distinct interactions between more than 17 000 unique human proteins. UniHI provides researchers with a flexible integrated tool for finding and using comprehensive information about the human interactome

Crossref

PubMed Central

Plymouth Electronic Archive and Research Library

MDC Repository

Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources

Author: Hänig Christian
Publication venue
Publication date: 17/04/2013
Field of study

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

Knowledge-free Verb Detection through Sentence Sequence Alignment

Author: Hänig Christian
Publication venue
Publication date: 10/05/2011
Field of study

Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 291-294. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1695

DSpace at Tartu University Library

Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources

Author: Hänig Christian
Publication venue
Publication date
Field of study

HSSS - Hochschulschriftenserver der SLUB

Modular Classifier Ensemble Architecture for Named Entity Recognition on Low Resource Systems

Author: Bordag Stefan
Hänig Christian
Thomas Stefan
Publication venue
Publication date: 25/11/2014
Field of study

This paper presents the best performing Named Entity Recognition system in the GermEval 2014 Shared Task. Our approach combines semi-automatically created lexical resources with an ensemble of binary classifiers which extract the most likely tag sequence. Out-of-vocabulary words are tackled with semantic generalization extracted from a large corpus and an ensemble of part-of-speech taggers, one of which is unsupervised. Unknown candidate sequences are resolved using a look-up with the Wikipedia API

University of Hildesheim